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AMENDMENTS TO THE CLAIMS 



This listing of claims replaces all prior versions and listings of claims In the 
application. 
In the claims: 

1, (Currently amended) A computer-implemented method of identifying table data in a 
document comprising the steps of; 

receiving a page description language representation of the document for 
providing a list of words in the document and position information for the words; and 

automatically identifying table data in the document based on the page 
description language representation of the document and at least one table identifying 
feature, wherein the identifying step includes, 

dividing the document into one or more pages; 
dividing each page into a plurality of lines; 
for each line, clustering the words of the line into one or more word 
clusters, wherein each cluster includes one or more words, each cluster having a 
horizontal beginning point, horizontal midpoint, and horizontal end point; 

for clusters in the plurality of lines, comparing alignme nt of the horizontal 
beginning point, horizontal midpoint, and horizontal end po int of clusters between 
lines, wherein a duster in a first line is aligned with a cluster in a previous line if 
at least one of the horizontal beginning point horizontal midpoint, a nd horizontal 
end point of the cluster in the first line is aligned with at least one of the 
horizontal beginning point horizontal midpoint, and horizontal end point of the 
cluster in the previous line: and 

identifying a line as being part of a table in response t o more than one 
cluster of the line being aligned with clusters of previous lines identified as part of 
the table: and 

automatica ll y i d e nt i fying tab le data in the docum e nt b a s e d on th e numbor 
of word c l ust e rs for oach lin e and tho a li gnment of tho horizontal b e ginn i ng 
po i nts, hor i zonta l m i dpoints, and horizontal ond points of word dusters between 

in iv«« 
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outputtinci data descriptive of the lin es of the table. 

2. (cancelled) 

3. (previously presented) The method of Claim 1 wherein the step of automatically 
identifying table data in the document based on the number of word clusters of each 
line and the alignment of the word clusters between lines further comprises: 

using the word clusters to generate column position information, wherein the 
column information includes for each column a horizontal beginning point, horizontal 
midpoint, and horizontal end point; and 

updating the column position information by performing a union operation 
between the column position information of a previous line and the column position 
information of a current line. 

4. (cancelled) 

5. (previously presented) The method of Claim 1 wherein receiving a page description 
language representation of the document for providing a list of words in the document 
and position information for the words includes receiving a PDF representation of the 
document, and wherein converting the table data encompassed by each table bounding 
box to a markup language representation includes converting the table data 
encompassed by each table bounding box to a HTML representation. 

6. (cancelled) 

7. (currently amended) A computer-readable medium having stored thereon sequences 
of instructions, said sequences of instructions including instructions which, when 
executed by a processor, cause said processor to perform the steps of: 

receiving a page description language representation of a document for 
providing a list of words in the document and position Information for the words; and 
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automatically identifying table data in the document based on the page 
description language representation of the document and at least one table identifying 
feature, wherein identifying includes, 

dividing the document into one or more pages; 

dividing each page into a plurality of lines; 

for each line, clustering the words of the line into one or more word 
clusters, wherein each cluster includes one or more words, each cluster having a 
horizontal beginning point, horizontal midpoint, and horizontal end point; and 

for clusters in the plura lity of lines, comparing alignment of the horizontal 
heainnina point, horizontal midpoint, and horizontal end point of clu sters between 
lines, wherein a cluster in a first line is aligned wit h a cluster in a previous line if 
at least one of the horizontal beginning point, horizontal midpoint, and horizontal 
end point of the cluster in the first line is aligned with at least one of the 
horizontal beginning point, horizontal midpoint, and hori zontal end point of the 
cluster in the previous line: and 

identifying a line as being part of a table in response to more than one 
cluster of the line being aligned with clusters of previous lines identified as part of 
the table; and 

automotioa ll y i dentify i ng tab l o data i n tho do c ument baood on tho numb e r 
of word clust e rs for e ach l in e and the a l ignm e nt of horizonta l b e g i nn i ng point s , 
horizontal m i dpo i nt s , and hor i zonta l ond po i nts of th e word clusters betw ee n 

f |i 1W 

outputting data descriptive of the lines of the table . 

8. (cancelled) 

9. (previously presented) The computer-readable medium of Claim 7 further containing 
instructions which, when executed by said processor, would cause said processor to 
perform the steps of: 
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using the word clusters to generate column position information, wherein the 
column information includes for each column a horizontal beginning point, horizontal 
midpoint, and horizontal end point; and 

updating the column position information by performing a union operation 
between the column position information of a previous line and the column position 
information of a current line. 

10. (cancelled) 

1 1 . (cancelled) 

12. (currently amended) A document processing system comprising: 

a processor for executing programs; and 

a table identification program for receiving a page description language 
representation of a document, the page description language representation providing a 
list of words in the document and position information for the words, and for 
automatically identifying table data in the document based on the page description 
representation of the document and at least one table identifying feature, wherein the 
identification program is configured to, 

divide the document into one or more pages; 

divide each page into a plurality of lines; 

for each line, cluster the words of the line into one or more word clusters, 
wherein each cluster includes one or more words, each cluster having a 
horizontal beginning point, horizontal midpoint, and horizontal end point; and 

for clusters in the plurality of lines, compare align ment of the horizontal 
beginning point, horizontal midpoint, and horizontal end po int of dusters between 
lines, wherein a cluster in a first line is aliened w ith a cluster in a previous line if 
at least one of the horizontal beginning point, horizon tal midpoint, and horizontal 
end point of the cluster in the first line is aliened with at leas t one of the 
horizontal beginning point, horizontal midpoint, a nd horizontal end point of the 
cluster in the previous line; and 
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identify a line as being part of a table in response to mo re than one cluster 
of the line being aliened with dusters of prev ious lines identified as part of ttje 
table; and 

outomatic a lly I dent i fy tabl o data i n tho documont bocod on tho number of 
word clusters for oach l ine and th o alignmont of horizonta l b e ginn i ng point& r 
hor izontal midpoints^ and horizontal o nd points of tho word Gluctoro botwoon 
lino s 

output data descriptive of the line s of the table. 

13. (cancelled) 

14. (cancelled) 

15. (previously presented) The document processing system of claim 12 wherein the 
table identification program further comprises: 

a conversion module coupled to the bounding box generation module for 
receiving the table bounding box for each table in the document, and for converting the 
words encompassed by the table bounding box into a markup language representation 
that maintains the table structure of each table. 

16. (previously presented) The method of claim 1 wherein the step of automatically 
identifying table data in the document based on the page description language 
representation of the document and at least one table identifying feature further 
comprises: 

automatically identifying table data in the document based on one or more table 
headings. 

17. (previously presented) The method of claim 1 wherein the step of automatically 
identifying table data in the document based on the page description language 
representation of the document and at least one table identifying feature further 
comprises: 
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automatically identifying table data in the document based on one or more 
horizontal lines and vertical lines that separate rows or columns of the table. 

1 8. (previously presented) The method of claim 1 , wherein the step of automatically 
identifying table data in the document based on the number of word clusters for each 
line and the alignment of the word clusters comprises: 

determining whether the number of word clusters in a line is greater than a 

threshold value; and 

classifying the word clusters in the line as a row of a table in response to the 
number of word clusters in a line being greater than the threshold value. 

19. (previously presented) The computer-readable medium of claim 7, wherein the 
instructions for automatically identifying table data in the document based on the 
number of word clusters for each line and the alignment of the word clusters include 
instructions that when executed by a processor cause the processor to perform the 
steps further comprising: 

determining whether the number of word clusters In a line is greater than a 

threshold value; and 

classifying the word clusters in the line as a row of a table in response to the 
number of word clusters in a line being greater than the threshold value. 

20. (previously presented) The document processing system of claim 12, wherein the 
table identification program is further configured to: 

determine whether the number of word clusters in a line is greater than a 
threshold value; and 

classify the word clusters in the line as a row of a table in response to the 
number of word clusters in a line being greater than the threshold value. 
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